{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### TAO remote client - Auto Labeling\n",
    "\n",
    "Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.\n",
    "\n",
    "![image](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)\n",
    "\n",
    "\n",
    "### The workflow in a nutshell\n",
    "\n",
    "- Creating a dataset\n",
    "- Upload dataset to the service\n",
    "- Getting a PTM from NGC\n",
    "- Model Actions\n",
    "    - Train (Normal/AutoML)\n",
    "    - Evaluate\n",
    "    - Inference on TAO\n",
    "\n",
    "### Table of contents\n",
    "\n",
    "1. [Install TAO remote client ](#head-1)\n",
    "1. [Set the remote service base URL](#head-2)\n",
    "1. [Access the shared volume](#head-3)\n",
    "1. [Create the datasets](#head-4)\n",
    "1. [List datasets](#head-5)\n",
    "1. [Create an experiment](#head-8)\n",
    "1. [Find pretrained model](#head-9)\n",
    "1. [Customize model metadata](#head-10)\n",
    "1. [View hyperparameters that are enabled for AutoML by default](#head-11)\n",
    "1. [Set AutoML related configurations](#head-12)\n",
    "1. [Provide train specs](#head-13)\n",
    "1. [Run train](#head-14)\n",
    "1. [View checkpoint files](#head-15)\n",
    "1. [Provide evaluate specs](#head-16)\n",
    "1. [Run evaluate](#head-17)\n",
    "1. [Provide TAO inference specs](#head-28)\n",
    "1. [Run TAO inference](#head-29)\n",
    "1. [Delete experiment](#head-32)\n",
    "1. [Delete datasets](#head-33)\n",
    "1. [Unmount shared volume](#head-34)\n",
    "1. [Uninstall TAO Remote Client](#head-35)\n",
    "\n",
    "### Requirements\n",
    "Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import glob\n",
    "import subprocess\n",
    "import getpass\n",
    "import ast\n",
    "import json\n",
    "import time\n",
    "from IPython.display import clear_output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "namespace = 'default'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### FIXME\n",
    "\n",
    "1. Assign the ip_address and port_number in FIXME 1 and FIXME 2 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))\n",
    "1. Assign the ngc_api_key variable in FIXME 3\n",
    "1. (Optional) Enable AutoML if needed in FIXME 4\n",
    "1. Choose between bayesian and hyperband automl_algorithm in FIXME 5 (If automl was enabled in FIXME4)\n",
    "1. Choose to download jobs or not in FIXME 6\n",
    "1. Choose between default and custom dataset in FIXME 7\n",
    "1. Assign path of DATA_DIR in FIXME 8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = \"mal\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Install TAO remote client <a class=\"anchor\" id=\"head-1\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# SKIP this step IF you have already installed the TAO-Client wheel.\n",
    "! pip3 install nvidia-transfer-learning-client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# View the version of the TAO-Client\n",
    "! nvtl --version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Set the remote service base URL <a class=\"anchor\" id=\"head-2\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Define the node_addr and port number\n",
    "workdir = \"workdir_auto_labeling\" # FIXME1\n",
    "host_url = \"http://<ip_address>:<port_number>\" # FIXME2 example: https://10.137.149.22:32334\n",
    "# In host machine, node ip_address and port number can be obtained as follows,\n",
    "# ip_address: hostname -i\n",
    "# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'\n",
    "\n",
    "ngc_api_key = \"<ngc_api_key>\" # FIXME3 example: (Add NGC API key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "automl_enabled = False # FIXME4 set to True if you want to run automl for the model chosen in the previous cell\n",
    "automl_algorithm = \"bayesian\" # FIXME5 example: bayesian/hyperband\n",
    "# FIXME6 Defaulted to False as downloading jobs from service to your machine takes time\n",
    "# Set to True if you want to download jobs where examples have been provided like for train, export, inference.\n",
    "download_jobs = False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%env BASE_URL={host_url}/{namespace}/api/v1\n",
    "\n",
    "# Exchange NGC_API_KEY for JWT\n",
    "identity = json.loads(subprocess.getoutput(f'nvtl login --ngc-api-key {ngc_api_key}'))\n",
    "\n",
    "%env USER={identity['user_id']}\n",
    "%env TOKEN={identity['token']}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Function to parse logs <a class=\"anchor\" id=\"head-1.1\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def my_tail(model_name_cli, id, job_id, job_type, workdir):\n",
    "\tstatus = None\n",
    "\twhile True:\n",
    "\t\ttime.sleep(10)\n",
    "\t\tclear_output(wait=True)\n",
    "\t\tlog_file_path = subprocess.getoutput(f\"nvtl {model_name_cli} get-log-file --id {id} --job {job_id} --job_type {job_type} --workdir {workdir}\")\n",
    "\t\tif not os.path.exists(log_file_path):\n",
    "\t\t\tcontinue\n",
    "\t\twith open(log_file_path, 'rb') as log_file:\n",
    "\t\t\tlog_contents = log_file.read()\n",
    "\t\tlog_content_lines = log_contents.decode(\"utf-8\").split(\"\\n\")        \n",
    "\t\tfor line in log_content_lines:\n",
    "\t\t\tprint(line.strip())\n",
    "\t\t\tif line.strip() == \"Error EOF\":\n",
    "\t\t\t\tstatus = \"Error\"\n",
    "\t\t\t\tbreak\n",
    "\t\t\telif line.strip() == \"Done EOF\":\n",
    "\t\t\t\tstatus = \"Done\"\n",
    "\t\t\t\tbreak\n",
    "\t\tif status is not None:\n",
    "\t\t\tbreak\n",
    "\treturn status"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Function to split tar files <a class=\"anchor\" id=\"head-1.1\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import tarfile\n",
    "\n",
    "def split_tar_file(input_tar_path, output_dir, max_split_size=0.2*1024*1024*1024):\n",
    "\tos.makedirs(output_dir, exist_ok=True)\n",
    "\t\n",
    "\twith tarfile.open(input_tar_path, 'r') as original_tar:\n",
    "\t\tmembers = original_tar.getmembers()\n",
    "\t\tcurrent_split_size = 0\n",
    "\t\tcurrent_split_number = 0\n",
    "\t\tcurrent_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')\n",
    "\t\t\n",
    "\t\twith tarfile.open(current_split_name, 'w') as split_tar:\n",
    "\t\t\tfor member in members:\n",
    "\t\t\t\tif current_split_size + member.size <= max_split_size:\n",
    "\t\t\t\t\tsplit_tar.addfile(member, original_tar.extractfile(member))\n",
    "\t\t\t\t\tcurrent_split_size += member.size\n",
    "\t\t\t\telse:\n",
    "\t\t\t\t\tsplit_tar.close()\n",
    "\t\t\t\t\tcurrent_split_number += 1\n",
    "\t\t\t\t\tcurrent_split_name = os.path.join(output_dir, f'smaller_file_{current_split_number}.tar')\n",
    "\t\t\t\t\tcurrent_split_size = 0\n",
    "\t\t\t\t\tsplit_tar = tarfile.open(current_split_name, 'w')  # Open a new split tar archive\n",
    "\t\t\t\t\tsplit_tar.addfile(member, original_tar.extractfile(member))\n",
    "\t\t\t\t\tcurrent_split_size += member.size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Create the datasets <a class=\"anchor\" id=\"head-4\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "We will be using the `COCO dataset`. `download_coco.sh` script from dataset prepare will be used to download and unzip the [coco2017 dataset](https://cocodataset.org/#download)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**If using custom dataset; it should follow this dataset structure**\n",
    "```\n",
    "DATA_DIR/train2017\n",
    "├── annotations.json\n",
    "├── images\n",
    "    ├── image_name_1.jpg\n",
    "    ├── image_name_2.jpg\n",
    "    ├── ...\n",
    "\n",
    "```\n",
    "```\n",
    "DATA_DIR/val2017\n",
    "├── annotations.json\n",
    "├── images\n",
    "    ├── image_name_1.jpg\n",
    "    ├── image_name_2.jpg\n",
    "    ├── ...\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dataset_to_be_used = \"default\" #FIXME7 example: default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset\n",
    "DATA_DIR = model_name # FIXME8\n",
    "os.environ['DATA_DIR']= DATA_DIR\n",
    "!mkdir -p $DATA_DIR\n",
    "job_map = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "if dataset_to_be_used == \"default\":\n",
    "    !bash ../dataset_prepare/coco/download_coco.sh $DATA_DIR\n",
    "    # Remove existing data\n",
    "    !rm -rf $DATA_DIR/train2017/images\n",
    "    !rm -rf $DATA_DIR/val2017/images\n",
    "    # Rearrange data in the required format\n",
    "    !mkdir -p $DATA_DIR/train2017/\n",
    "    !mkdir -p $DATA_DIR/val2017/\n",
    "    !mv $DATA_DIR/raw-data/train2017 $DATA_DIR/train2017/images\n",
    "    !mv $DATA_DIR/raw-data/annotations/instances_train2017.json $DATA_DIR/train2017/annotations.json\n",
    "    !mv $DATA_DIR/raw-data/val2017 $DATA_DIR/val2017/images\n",
    "    !mv $DATA_DIR/raw-data/annotations/instances_val2017.json $DATA_DIR/val2017/annotations.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verify the downloaded dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!if [ ! -d $DATA_DIR/train2017/images ]; then echo 'Train Images folder not found'; else echo 'Found Train images folder';fi\n",
    "!if [ ! -f $DATA_DIR/train2017/annotations.json ]; then echo 'Train annotations file not found'; else echo 'Found Train annotations file';fi\n",
    "!if [ ! -d $DATA_DIR/val2017/images ]; then echo 'Val Images folder not found'; else echo 'Found Val images folder';fi\n",
    "!if [ ! -f $DATA_DIR/val2017/annotations.json ]; then echo 'Val annotations file not found'; else echo 'Found Val annotations file';fi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -C $DATA_DIR/train2017 -czf $DATA_DIR/coco_train.tar.gz images annotations.json\n",
    "!tar -C $DATA_DIR/val2017 -czf $DATA_DIR/coco_val.tar.gz images annotations.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset_path = f\"{DATA_DIR}/coco_train.tar.gz\"\n",
    "eval_dataset_path = f\"{DATA_DIR}/coco_val.tar.gz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ds_type = \"instance_segmentation\"\n",
    "ds_format = \"coco\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "train_dataset_id = subprocess.getoutput(f\"nvtl {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}\")\n",
    "print(train_dataset_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "output_dir = os.path.join(os.path.dirname(os.path.abspath(train_dataset_path)), model_name, \"train\")\n",
    "split_tar_file(train_dataset_path, output_dir)\n",
    "for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):\n",
    "    print(f\"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split\")\n",
    "    upload_train_dataset_message = subprocess.getoutput(f\"nvtl {model_name} dataset-upload --id {train_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}\")\n",
    "    print(upload_train_dataset_message)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "eval_dataset_id = subprocess.getoutput(f\"nvtl {model_name} dataset-create --dataset_type {ds_type} --dataset_format {ds_format}\")\n",
    "print(eval_dataset_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "output_dir = os.path.join(os.path.dirname(os.path.abspath(eval_dataset_path)), model_name, \"val\")\n",
    "split_tar_file(eval_dataset_path, output_dir)\n",
    "for idx, tar_dataset_path in enumerate(os.listdir(output_dir)):\n",
    "    print(f\"Uploading {idx+1}/{len(os.listdir(output_dir))} tar split\")\n",
    "    upload_val_dataset_message = subprocess.getoutput(f\"nvtl {model_name} dataset-upload --id {eval_dataset_id} --path {os.path.join(output_dir,tar_dataset_path)}\")\n",
    "    print(upload_val_dataset_message)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### List datasets <a class=\"anchor\" id=\"head-5\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "message = subprocess.getoutput(f\"nvtl {model_name} list-datasets\")\n",
    "message = ast.literal_eval(message)\n",
    "for rsp in message:\n",
    "    rsp_keys = rsp.keys()\n",
    "    assert \"id\" in rsp_keys\n",
    "    assert \"type\" in rsp_keys\n",
    "    assert \"format\" in rsp_keys\n",
    "    assert \"name\" in rsp_keys\n",
    "    print(rsp[\"id\"],\"\\t\",rsp[\"type\"],\"\\t\",rsp[\"format\"],\"\\t\\t\",rsp[\"name\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Create an experiment <a class=\"anchor\" id=\"head-8\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "network_arch = model_name\n",
    "experiment_id = subprocess.getoutput(f\"nvtl {model_name} experiment-create --network_arch {network_arch} --encryption_key tlt_encode \")\n",
    "print(experiment_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assign train, eval datasets <a class=\"anchor\" id=\"head-10\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docker_env_vars = {} # Update any variables to be included while triggering Docker run-time like MLOPs variables \n",
    "dataset_information = {\"train_datasets\":[train_dataset_id],\n",
    "                       \"eval_dataset\":eval_dataset_id,\n",
    "                       \"inference_dataset\":eval_dataset_id,\n",
    "                       \"docker_env_vars\": docker_env_vars,\n",
    "                       \"metric\": \"train_loss\"}\n",
    "patched_model = subprocess.getoutput(f\"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(dataset_information)}' \")\n",
    "print(patched_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Find pretrained model <a class=\"anchor\" id=\"head-9\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List all pretrained models for the chosen network architecture\n",
    "filter_params = {\"network_arch\": network_arch}\n",
    "message = subprocess.getoutput(f\"nvtl {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'\")\n",
    "message = ast.literal_eval(message)\n",
    "for rsp in message:\n",
    "    rsp_keys = rsp.keys()\n",
    "    if \"encryption_key\" not in rsp.keys():\n",
    "        assert \"name\" in rsp_keys and \"version\" in rsp_keys and \"ngc_path\" in rsp_keys and \"additional_id_info\" in rsp_keys\n",
    "        print(f'PTM Name: {rsp[\"name\"]}; PTM version: {rsp[\"version\"]}; NGC PATH: {rsp[\"ngc_path\"]}; Additional info: {rsp[\"additional_id_info\"]}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Assign PTM <a class=\"anchor\" id=\"head-10\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Assigning pretrained models\n",
    "# From the output of previous cell make the appropriate changes to this map if you want to change the default PTM backbone.\n",
    "# Changing the default backbone here requires changing default spec/config during train/eval etc like for example\n",
    "# If you are changing the ptm to resnet34, then you have to modify the config key num_layers if it exists to 34 manually\n",
    "pretrained_map = {\"mal\" : \"mask_auto_label:trainable_v1.0\"}\n",
    "no_ptm_models = set([])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "if network_arch not in no_ptm_models:\n",
    "    filter_params = {\"network_arch\": network_arch}\n",
    "    message = subprocess.getoutput(f\"nvtl {model_name} list-experiments --filter_params '{json.dumps(filter_params)}'\")\n",
    "    message = ast.literal_eval(message)\n",
    "    ptm = []\n",
    "    for rsp in message:\n",
    "        rsp_keys = rsp.keys()\n",
    "        assert \"ngc_path\" in rsp_keys\n",
    "        if rsp[\"ngc_path\"].endswith(pretrained_map[network_arch]):\n",
    "            assert \"id\" in rsp_keys\n",
    "            ptm_id = rsp[\"id\"]\n",
    "            ptm = [ptm_id]\n",
    "            print(\"Metadata for model with requested NGC Path\")\n",
    "            print(rsp)\n",
    "            break\n",
    "    print(ptm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "if network_arch not in no_ptm_models:\n",
    "    ptm_information = {\"base_experiment\":ptm}\n",
    "    patched_model = subprocess.getoutput(f\"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(ptm_information)}' \")\n",
    "    print(patched_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### View hyperparameters that are enabled for AutoML by default <a class=\"anchor\" id=\"head-11\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "if automl_enabled:\n",
    "    # View default automl specs enabled\n",
    "    ! nvtl {model_name} model-automl-defaults --id {experiment_id}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Set AutoML related configurations <a class=\"anchor\" id=\"head-12\"></a>\n",
    "Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "if automl_enabled:\n",
    "    # Choose any metric that is present in the kpi dictionary present in the model's status.json. \n",
    "    # Example status.json for each model can be found in the respective section in NVIDIA TAO DOCS here: https://docs.nvidia.com/tao/tao-toolkit/text/model_zoo/cv_models/index.html\n",
    "    metric=\"kpi\"\n",
    "\n",
    "    additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones\n",
    "    remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML\n",
    "\n",
    "    automl_information = {\"automl_enabled\":automl_enabled,\n",
    "                          \"automl_algorithm\":automl_algorithm,\n",
    "                          \"automl_max_recommendations\": 20, # Only for bayesian\n",
    "                          \"automl_R\": 27, # Only for hyperband\n",
    "                          \"automl_nu\": 3, # Only for hyperband\n",
    "                          \"epoch_multiplier\": 1, # Only for hyperband\n",
    "                          \"metric\":metric,\n",
    "                          \"automl_add_hyperparameters\":str(additional_automl_parameters),\n",
    "                          \"automl_remove_hyperparameters\":str(remove_default_automl_parameters)\n",
    "                         }\n",
    "    patched_model = subprocess.getoutput(f\"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(automl_information)}' \")\n",
    "    patched_model = json.loads(patched_model)\n",
    "    print(json.dumps(patched_model, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Provide train specs <a class=\"anchor\" id=\"head-13\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Default train model specs\n",
    "train_specs = subprocess.getoutput(f\"nvtl {model_name} get-spec --action train --job_type experiment --id {experiment_id}\")\n",
    "train_specs = json.loads(train_specs)\n",
    "print(json.dumps(train_specs, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Customize train model specs\n",
    "train_specs[\"gpu_ids\"] = [0]\n",
    "train_specs[\"train\"][\"num_epochs\"] = 5\n",
    "print(json.dumps(train_specs, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "MD"
    }
   },
   "source": [
    "### Run train <a class=\"anchor\" id=\"head-14\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "datalore": {
     "hide_input_from_viewers": false,
     "hide_output_from_viewers": false,
     "type": "CODE"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "job_id = subprocess.getoutput(f\"nvtl {model_name} experiment-run-action --action train --id {experiment_id} --specs '{json.dumps(train_specs)}'\")\n",
    "job_map[\"train_\" + model_name] = job_id\n",
    "print(job_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Monitor job status\n",
    "if automl_enabled:    \n",
    "    while True:\n",
    "        clear_output(wait=True)\n",
    "        response = subprocess.getoutput(f\"nvtl {model_name} get-action-status --job_type experiment --id {experiment_id} --job {job_id}\")\n",
    "        response = json.loads(response)\n",
    "        if \"error_desc\" in response.keys() and response[\"error_desc\"] in (\"Job trying to retrieve not found\", \"No AutoML run found\"):\n",
    "            print(\"Job is being created\")\n",
    "            time.sleep(5)\n",
    "            continue\n",
    "        print(json.dumps(response, sort_keys=True, indent=4))\n",
    "        assert \"status\" in response.keys() and response.get(\"status\") != \"Error\"\n",
    "        if response.get(\"status\") in [\"Done\",\"Error\"]:\n",
    "            break\n",
    "        time.sleep(15)\n",
    "else:\n",
    "    # Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)\n",
    "    status = my_tail(model_name, experiment_id, job_id, \"experiment\", workdir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## To Stop an AutoML JOB\n",
    "#    1. Stop the 'Monitor job status' cell (the cell right before this cell) manually\n",
    "#    2. Uncomment the snippet in the next cell and run the cell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if automl_enabled:\n",
    "#     job_id = job_map[\"train_\" + model_name]\n",
    "#     job_id = subprocess.getoutput(f\"nvtl {model_name} job-cancel --job_type experiment --id {experiment_id} --job {job_id}\")\n",
    "#     job_map[\"canceled_\" + model_name] = job_id\n",
    "#     print(job_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Resume AutoML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor job status' cell above (4th cell above from this cell)\n",
    "# if automl_enabled:\n",
    "#     job_id = job_map[\"train_\" + model_name]\n",
    "#     job_id = subprocess.getoutput(f\"nvtl {model_name} job-resume --id {experiment_id} --job {job_id}  --parent_job_id {parent}  --specs '{json.dumps(train_specs)}'\")\n",
    "#     job_map[\"resumed_\" + model_name] = job_id\n",
    "#     print(job_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Viewing checkpoint files <a class=\"anchor\" id=\"head-15\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_id = job_map[\"train_\" + model_name]\n",
    "file_list = subprocess.getoutput(f\"nvtl {model_name} list-job-files --id {experiment_id} --job {job_id} --job_type experiment --retrieve_logs True --retrieve_specs False\")\n",
    "print(file_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Patch the model with proper metric before training to run this cell; By default loss is used, but some models dont log the parameter under the name 'loss'\n",
    "# file_lists = []\n",
    "# temptar = subprocess.getoutput(f\"nvtl {model_name} download-selective-files --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir} --file_lists '{file_lists}' --best_model False --latest_model True --tar_files True\")\n",
    "# tar_command = f'tar -xvf {temptar} -C {workdir}/'\n",
    "# os.system(tar_command)\n",
    "# os.remove(temptar)\n",
    "# print(f\"Results at {workdir}/{job_id}\")\n",
    "# model_downloaded_path = f\"{workdir}/{job_id}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Downloading train job takes a longer time, uncomment this cell if you want to still proceed\n",
    "if download_jobs:\n",
    "    temptar = subprocess.getoutput(f\"nvtl {model_name} download-entire-job --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir}\")\n",
    "    tar_command = f'tar -xvf {temptar} -C {workdir}/'\n",
    "    os.system(tar_command)\n",
    "    os.remove(temptar)\n",
    "    print(f\"Results at {workdir}/{job_id}\")\n",
    "    model_downloaded_path = f\"{workdir}/{job_id}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# View the checkpoints generated for the training job and for automl jobs, in addition view: best performing model's config and the results of all automl experiments\n",
    "\n",
    "if download_jobs:\n",
    "    if automl_enabled:\n",
    "        !python3 -m pip install pandas==1.5.1\n",
    "        import pandas as pd\n",
    "        model_downloaded_path = f\"{model_downloaded_path}/best_model\"\n",
    "        assert glob.glob(f\"{model_downloaded_path}/*.protobuf\") or glob.glob(f\"{model_downloaded_path}/*.yaml\")\n",
    "\n",
    "    assert os.path.exists(model_downloaded_path)\n",
    "    assert (glob.glob(model_downloaded_path + \"/**/*.tlt\", recursive=True) + glob.glob(model_downloaded_path + \"/**/*.hdf5\", recursive=True) + glob.glob(model_downloaded_path + \"/**/*.pth\", recursive=True))\n",
    "\n",
    "    if os.path.exists(model_downloaded_path):        \n",
    "        #List the binary model file\n",
    "        print(\"\\nCheckpoints for the training experiment\")\n",
    "        if os.path.exists(model_downloaded_path+\"/train/weights\") and len(os.listdir(model_downloaded_path+\"/train/weights\")) > 0:\n",
    "            print(f\"Folder: {model_downloaded_path}/train/weights\")\n",
    "            print(\"Files:\", os.listdir(model_downloaded_path+\"/train/weights\"))\n",
    "        elif os.path.exists(model_downloaded_path+\"/weights\") and len(os.listdir(model_downloaded_path+\"/weights\")) > 0:\n",
    "            print(f\"Folder: {model_downloaded_path}/weights\")\n",
    "            print(\"Files:\", os.listdir(model_downloaded_path+\"/weights\"))\n",
    "        else:\n",
    "            print(f\"Folder: {model_downloaded_path}\")\n",
    "            print(\"Files:\", os.listdir(model_downloaded_path))\n",
    "\n",
    "        if automl_enabled:\n",
    "            assert glob.glob(f\"{model_downloaded_path}/*.protobuf\") or glob.glob(f\"{model_downloaded_path}/*.yaml\")\n",
    "            experiment_artifacts = json.load(open(f\"{model_downloaded_path}/controller.json\",\"r\"))\n",
    "            data_frame = pd.DataFrame(experiment_artifacts)\n",
    "            # Print experiment id/number and the corresponding result\n",
    "            print(\"\\nResults of all experiments\")\n",
    "            with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):\n",
    "                print(data_frame[[\"id\",\"result\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Provide evaluate specs <a class=\"anchor\" id=\"head-16\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Default evaluate model specs\n",
    "eval_specs = subprocess.getoutput(f\"nvtl {model_name} get-spec --action evaluate --job_type experiment --id {experiment_id}\")\n",
    "eval_specs = json.loads(eval_specs)\n",
    "print(json.dumps(eval_specs, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Customize evaluate model specs\n",
    "# Change any spec if you wish\n",
    "print(json.dumps(eval_specs, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run evaluate <a class=\"anchor\" id=\"head-17\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print model handler parameters\n",
    "model_parameters = subprocess.getoutput(f\"nvtl {model_name} get-metadata --id {experiment_id} --job_type experiment\")\n",
    "model_parameters = json.loads(model_parameters)\n",
    "update_checkpoint_choosing = {}\n",
    "update_checkpoint_choosing[\"checkpoint_choose_method\"] = model_parameters[\"checkpoint_choose_method\"]\n",
    "update_checkpoint_choosing[\"checkpoint_epoch_number\"] = model_parameters[\"checkpoint_epoch_number\"]\n",
    "print(json.dumps(update_checkpoint_choosing, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change the method by which checkpoint from the parent action is chosen, when parent action is a train/retrain action.\n",
    "# Example for evaluate action below, can be applied in the same way for other actions too\n",
    "update_checkpoint_choosing[\"checkpoint_choose_method\"] = \"latest_model\" # Choose between best_model/latest_model/from_epoch_number\n",
    "# If from_epoch_number is chosen then assign the epoch number to the dictionary key in the format 'from_epoch_number{train_job_id}'\n",
    "# update_checkpoint_choosing[\"checkpoint_epoch_number\"][\"from_epoch_number_c2f76eb7-2a75-4197-9a84-c1547f20c17d\"] = 6\n",
    "\n",
    "patched_model = subprocess.getoutput(f\"nvtl {model_name} patch-artifact-metadata --id {experiment_id} --job_type experiment --update_info '{json.dumps(update_checkpoint_choosing)}'\")\n",
    "patched_model = json.loads(patched_model)\n",
    "print(json.dumps(patched_model, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "parent = job_map[\"train_\" + model_name]\n",
    "job_id = subprocess.getoutput(f\"nvtl {model_name} experiment-run-action --action evaluate --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(eval_specs)}'\")\n",
    "job_map[\"eval_\" + model_name] = job_id\n",
    "print(job_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)\n",
    "status = my_tail(model_name, experiment_id, job_id, \"experiment\", workdir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Provide TAO inference specs <a class=\"anchor\" id=\"head-28\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Default inference model specs\n",
    "tao_inference_specs = subprocess.getoutput(f\"nvtl {model_name} get-spec --id {experiment_id} --action inference --job_type experiment\")\n",
    "tao_inference_specs = json.loads(tao_inference_specs)\n",
    "print(json.dumps(tao_inference_specs, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Customize TAO inference specs\n",
    "#Apply changes to the specs dictionary here if required\n",
    "print(json.dumps(tao_inference_specs, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run TAO inference <a class=\"anchor\" id=\"head-29\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "parent = job_map[\"train_\" + model_name]\n",
    "job_id = subprocess.getoutput(f\"nvtl {model_name} experiment-run-action --action inference --id {experiment_id} --parent_job_id {parent} --specs '{json.dumps(tao_inference_specs)}'\")\n",
    "job_map[\"tlt_inference_\" + model_name] = job_id\n",
    "print(job_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)\n",
    "status = my_tail(model_name, experiment_id, job_id, \"experiment\", workdir)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if download_jobs:\n",
    "    temptar = subprocess.getoutput(f\"nvtl {model_name} download-entire-job --id {experiment_id} --job {job_id} --job_type experiment --workdir {workdir}\")\n",
    "    tar_command = f'tar -xvf {temptar} -C {workdir}/'\n",
    "    os.system(tar_command)\n",
    "    os.remove(temptar)\n",
    "    print(f\"Results at {workdir}/{job_id}\")\n",
    "    inference_out_path = f\"{workdir}/{job_id}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "if download_jobs:\n",
    "    !ls {inference_out_path}/inference.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete model <a class=\"anchor\" id=\"head-21\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "subprocess.getoutput(f\"nvtl {model_name} experiment-delete --id {experiment_id}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete dataset <a class=\"anchor\" id=\"head-21\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Delete train dataset <a class=\"anchor\" id=\"head-21\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "subprocess.getoutput(f\"nvtl {model_name} dataset-delete --id {train_dataset_id}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Delete val dataset <a class=\"anchor\" id=\"head-21\"></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "subprocess.getoutput(f\"nvtl {model_name} dataset-delete --id {eval_dataset_id}\")"
   ]
  }
 ],
 "metadata": {
  "datalore": {
   "base_environment": "default",
   "computation_mode": "JUPYTER",
   "package_manager": "pip",
   "packages": [],
   "version": 1
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
